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Abstract 

This classroom quasi-experiment aimed to learn if and to what degree supplementing 
classroom instruction with Rosetta Stone (RS), Tell Me More (TMM), Memrise (MEM), or 
ESL WOW (WOW) impacted high-stakes English test performance in areas of university- 
level writing, reading, speaking, listening, and grammar. Seventy-eight (iV = 78) Chinese 
learners of English in a cross-border higher education program with a Midwestern U.S. 
state comprehensive university made up the participants. From the pretest/posttest 
repeated-measures design, quantitative results included (1) no practice-and-drill 
software group (RS, TMM, and MEM) significantly outperforming the reference software 
group (WOW) although statistically significant improvements appeared among 
subsections of the test and among test totals; (2) no significant correlation between 
number of hours spent using technology and posttest scores; and (3) statistically 
significant differences between female and male participants in all groups in the areas of 
pretest scores, posttest scores, and total number of hours using language-learning 
software over the study's 15-week period. These findings offer insight into supplementing 
language classroom instruction with language learning software in Mainland China, with 
implications for language teachers across Asia. 

235,597 Chinese students (28.7 percent of all international pupils) enrolled in U.S. 
universities for the 2012-2013 academic year, a 21 percent increase from the previous 
academicyear (Institute of International Education, 2013). Increasingly, Chinese learners 
begin or continue U.S.-based higher education while never having to leave China. This, 
along with available literature on the interrelations of cross-border higher education 
(CBHE), English performance on high-stakes standardized tests, and students’ money¬ 
making potential in China, explains the need for sustained attention Chinese learners 
deserve from institutions of higher learning. 

In China, a competitive job market underscores the importance of higher education and 
particularly of English test performance. Employers in China view a job applicant’s 
academic-performance record as a device for signaling ability and screening out 
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unqualified applicants (Wang & Morgan, 2009), and young workers in China with 
advanced skills have experienced jobless rates ofapproximately4.1%in a context where 
roughly 8% are reported as unemployed (Orlik, 2012). After the economic crisis of 2009, 
jobs in China became harder to find for university graduates (Qi, 2011), and the reputation 
of the university related to the amount of money graduates with bachelor degrees tended 
to earn, correlating with 25-30% higher wages for students with degrees from top-100 
Chinese universities and with 15-20% higher wages for graduates from top-200 schools 
(Hartog, Sun, & Ding, 2010). Students from the most prestigious Chinese schools settled 
least frequently for jobs they did not want and suffered least frequently from 
dissatisfaction from over-education (Li, Zhao, & Tian, 2010). Chinese learners who cannot 
matriculate into top-tier universities, however, can still benefit from performance on high - 
stakes standardized English-language tests. Good grades in conjunction with "a National 
Standard English Certificate" contribute to Chinese graduates finding the jobs they want 
and making more money (Zhao, 2009). 

For years, teachers, researchers, administrators, and government officials in China have 
looked to computer-assisted language learning (CALL) to boost learners’ performance of 
English on standardized tests. They have pointed out the growing interest of CALL and the 
need for teachers to develop learners’ English in response to growing access to technology 
in China (Li, 2007), with Ruan (2008) specifically calling for classroom-based research to 
explore local issues related to Chinese learners of English and CALL. 

This classroom-based quasi-experiment used a quantitative repeated-measures design to 
address a local problem at the research site, hoping to skirt the unwitting duplication of 
effort that Shield (2009) warned against when hypotheses and designs previously tested 
on obsolete technology reappear in publications on newer, more popular technology. 
Specifically, this study aimed to learn if and to what degree supplementing classroom 
instruction with Rosetta Stone (RS), Tell Me More (TMM), Memrise (MEM), or ESL WOW 
(WOW) impacted English performance on a high-stakes standardized test meant to 
measure writing, reading, speaking, listening, and grammar of Chinese learners of English 
in a CBHE context in Mainland China. 

So far, researchers have focused only on RS and TMM. Peer -reviewed studies criticizing RS 
from theoretical standpoints have identified erroneous theoretical underpinnings, 
culturally inappropriate images, and little connection to learners’ lives (DeWaard, 2013; 
Lopez-Lopez, 2012). Among the few available participant-based studies, Nielson’s (2011) 
study explored how adult learners used both RS and TMM and, to the chagrin of the 
researcher, found huge attrition among participants and negligible benefit for learners. 
Meanwhile, Demir and Korkmaz (2013) reported significant changes among pre- and 
posttest performance measures for listening and speaking after 21 hours of RS use versus 
traditional teaching-English-as-a-foreign-language (TEFL) classroom instruction, while a 
participant-based study on learner perceptions of TMM reported that learners perceived 
the software as helpful for learning English (Hashim & Yunus, 2010). 

Overall, deficiencies in the evidence appear on two levels. From a research-based point of 
view, few participant-based studies have looked at RS or TMM, while none have looked at 
MEM or WOW. Of the existing research on RS and TMM, seemingly nobody has situated 
research objectives into an experimental design or had access to this present study’s kind 
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of research site (Chinese CBHE) or participants. In addition, although Hashim and Yunus 
(2010) measured learner perceptions of TMM, the study failed to report how much 
participants used the technology during the study. Deficiencies in the evidence appear 
also from a practice-based point of view. While previous research has looked at RS and 
TMM as stand-alone packages, nobody has studied the impact of these programs as 
supplements to classroom instruction or in comparison with the impact of alternative 
software. 

In light of these areas where knowledge can be built upon, the following research question 
guided the present study: How do Rosetta Stone, Tell Me More, Memrise, and ESL WOW 
impact the English performance on a high-stakes standardized test meant to measure 
reading, writing, listening, speaking, and grammar? 

Literature Review 

To justify the need for this study, the following literature review reiterates and further 
details places of potential growth in knowledge related to Chinese learners of English and 
to the language-learning programs used here. 

Rosetta Stone, Tell Me More, Memrise, and ESL WOW 

Few peer-reviewed scholarly articles have explored RS and TMM, while no article has 
examined either MEM or WOW. Using Cummins, Brown, and Sayers’s (2007) framework 
for evaluating software in schools, Lopez-Lopez (2012) pointed out perceived 
shortcomings of RS, including, (a) a lack of opportunities for critical thinking or cognitive 
challenge; (b) a failure to clarify deep connections between vocabulary, grammar, and the 
learners’ lives; (c) a limit on collaborative inquiry in the online games and practice 
activities; (d) a gap in any kind of engaging reading or writing practice across the 
curriculum; and (e) no chance to develop effective reading or writing strategies. Still, the 
program appeared to offer opportunities for "effective involvement and identity 
investment" through collaboration with other students and instructors online (Lopez- 
Lopez, 2012, p. 7). DeWaard (2013) likewise evaluated RS in light ofthe company’s push 
to enter academia and concluded that the software package was "not a viable option for 
language learning” (p. 61), citing unclear theoretical underpinnings, culturally 
inappropriate images, and nonstandard proficiency levels. Nielson (2011) explored how 
adult learners used both RS and TMM and, consistent with literature on the 
ineffectiveness of self-study alone (Benson, 2007; Holec, 1981), found significant 
attrition: Ofthe more than 300 participants, only5 completed the full studyprotocol,with 
half of the participants never accessing the software once and only a handful benefiting 
from the study. Hashim and Yunus (2010) surveyed the perceptions learners of English 
held regarding the usefulness of TMM; most participants thought the tool was easy to use, 
almost all felt it helped them improve their English performance, and most thought it was 
suitable for language learning. Still, the survey design failed to either measure or report 
how much participants used the software or whether participants showed improvement 
on performance measures. In contrast, this present study sought to measure both the 
amount of time learners used the software and the extent to which participants’ mean 
scores on pre- and posttests changed over the duration of the study. 
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Chinese English Language Learners and CALL 

A significant amount of previous research on Chinese English language learners (ELLs) 
has explained learner tendencies and representative English-language settings in China. 
Previous studies have emphasized the importance of instrumental motivation for Chinese 
ELLs, with learners’ goals including scoring favorably on high-stakes exams, winning 
scholarships, receiving praise from teachers, earning higher grades, or landing good jobs 
(Li, 2014; Liu, 2012; Ning & Hornby, 2013; Peng & Woodrow, 2010; Zhang & Guo, 2012; 
Zheng, 2012). Earlier research also has highlighted anxiety as affecting Chinese ELLs 
motivated learning behavior (Li, 2014) and willingness to speak English in class (Li, 2014; 
Liu, 2006; Liu & Jackson, 2011). Though some research indicated that female Chinese 
participants tended to report higher motivation to learn English than males (Lamb, 2004; 
Liu, 2009, 2012; Yang, Liu, & Wu, 2010), biological sex of Chinese ELLs did not seem to 
predict significant differences in English proficiency or language-learning strategy use in 
other research (Nisbet, Tindall, & Arroyo, 2005). As Zhou and Xu (2012) found, although 
currently in China adolescents have more freedom to choose college majors than earlier 
generations, those whose universities selected majors for them tended to have more 
negative attitudes toward learning, less inclination to pick up new learning strategies, and 
poorer academic performance. This lack of choice may extend to Chinese teachers as well. 
You (2004) found overworked, institutionally-restricted Chinese English writing 
instructors typically followed a current-traditional approach, emphasizing correct form 
over thought development to not prepare students for communication, but to ready 
learners for high-stakes exams. 

Alongthese lines, Chinese ELLs have been described as remaining accustomed to test- and 
teacher-centered classrooms where lectures dominate (Li, 2014; Lu, Li, & Du, 2009) and 
where a focus on receptive knowledge of English vocabulary and prescriptive grammar 
sometimes prevents active, productive use or communication (Ma, 2012; Peng & 
Woodrow, 2010; Zheng, 2012). As a result of compulsory learning of English, Chinese 
learners may not enjoy learning English (Ning & Hornby, 2013). In addition, other 
research has pointed out that, in spite of years of studying English, many Chinese students 
perform poorly on speaking measures (Peng, 2014). With these challenges in mind, many 
have looked to CALL as a way to motivate students and boost Chinese ELLs ’ performance 
on tests and elsewhere. 

Lu, Li, and Du (2009) discussed the Chinese Department of Higher Education’s call for 
more student-centered, communicative language teaching (CLT) approaches in English 
language classes, especially through the incorporation of CALL. A perceived lack of 
speaking and listening skills inspired these directives, with Liao (2004) arguing that the 
government’s position of embracing CLT would "bring about a positive effect on English 
teaching and learning” (p. 272). Still, Niu-Cooper (2012) noted challenges regarding 
change in China, including Chinese teachers’ habitually teacher-centered styles and 
crowded classrooms thatended up restricting pair and group work. Regarding CLT-guided 
CALL, Li (2007) listed more challenges to change in Chinese English-language-learning 
contexts, such as scarcity of reliable hardware and software, teachers’ lack of training in 
CALL who remained reluctant to change teaching styles, and teachers’ uncertainty about 
howto use CALL materials. 
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Previous CALL research in China has implored teachers of English to develop relevant 
skills in teaching with technology (Li, 2007). While warning researchers in China not to 
carry out CALL-related studies simply to tout a popular theory, Ruan (2008) called for 
classroom-based designs to explore how best to use CALL in existing classrooms. Past 
studies have illustrated significant gains in speaking performance, including varied 
vocabulary and more complex syntax, after e-learning treatment in English-language 
speaking classes in China (Shen & Suwanthep, 2011). Zou (2011) also found that, given 
teacher training and reliable hardware, CALL in China fostered learner autonomy. 

lustification for the Present Study 

The above literature justifies the present study in two main ways: (a) This study sought to 
give much-needed attention to popular language-learning programs that are increasingly 
entering academia and, also, that have attracted little to no attention in the way of 
participant-based studies, and (b) this study explored whether supplementing classroom 
instruction with popular language-learning software significantly impacted language 
performance on a high-stakes language test, offering practical guidance for administrators 
and instructors involved with and invested in the research site. 

Methods 

Research Question 

The researcher investigated the following research question: Ho wdoes Rosetta Stone, Tell 
Me More, Memrise, and ESL WOW impact the English performance on a high-stakes 
standardized test meant to measure reading, writing, listening, speaking, and grammar? 

Participants 

Seventy-eight [N - 78; 43 females, 35 males) Chinese ELLs between the ages of 20-22 
studying at a private Chinese college in a cross-border program with a Midwestern U.S. 
state-comprehensive university (SCU) made up the participants in this study. The 
researcher used a nonprobability sampling approach of convenience representative of 
classroom-based research before the semester began and randomly identified language - 
learning software for each of his four sections of English writing: RS: n - 19 (5 females, 14 
males); TMM: n- 19 (14 females, 5 males); MEM: n = 21 (12 females, 9 males); and WOW 
(n = 19 (13 females, 6 males). All participants spoke Mandarin Chinese. Students were 
placed into the cross-border program based on scores earned on the Chinese university’s 
entrance exam. The researcher’s university Internal Review Board approved the research 
and data-collection measures. Permission was obtained from the Chinese partner 
university, and all participants gave informed consent in Mandarin. 

Materials 

Rosetta Stone. Stoltzfus (1997) identified the theoretical underpinnings of RS as the 
comprehension or natural theory, in which second language acquisition (SLA) starts with 
listening comprehension, mimicking the process of first language acquisition (FLA). 
Lopez-Lopez (2012) noted that the program melded text, images, and sound at student- 
appropriate levels while describing "immersion" in RS as involving "intuition, interaction, 
and instruction" (p. 2). Accordingly, learners view an image with English text and then 
hear and are prompted to repeat or explain the stimulus into a microphone, allowing a 
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speech-recognition tool to measure comprehensibility. Drawing on personal experience 
in using the package to learn German, Hiebert (2012) wrote that RS supported SLA more 
effectively than the text-image coupling in children’s books by providing "critical and 
consistent data without substantial amounts of diverting information" (p. 291). Learners 
can chat with other RS learners who happen to be online and can also eventually talk with 
an RS tutor. 

Tell Me More. TMM includes more drilling in grammar and vocabulary than RS and also 
seems to feature a more sensitive speech-recognition tool. TMM involves fill-in-the-blank 
exercises and puzzles and gives learners the choice of focusing on everyday English or on 
business/professional English. In a description of TMM, Brynko (2008) noted, "six 
workshop activities: culture (geography lessons and city tours), written (practice your 
prose), vocabulary and grammar (fine-tune yourskills), oral(speakthe language), lessons 
(activities from matching words to multiple choice), and dialogue (using speech 
recognition)” (p. 41). During the semester in which the researcher was carrying out the 
project, RS acquired TMM. 

Memrise. MEM is a free online learning tool with user-created classes f memrise.com L In 
language classes, learners mostly match words, phrases, or sentences with definitions or 
translations. If the learner forgets the meaning of the target form, he or she may review 
an image that the user had earlier selected or created to correspond with that target form. 
Learners also may hear voice recordings of English speakers repeating target forms. 
Learners earn points for completing levels, and usernames get poste don weekly, monthly, 
and all-time leaderboards in real time in competition with users worldwide. Learners 
produce language by typing answers into blanks while a timer clicks toward zero. 

ESL Writing Online Workshop (ESL WOW). Essentially a reference website explaining 
the process of writing, WOW features animations of a student getting help from a writing - 
centertutor ( esl-wow.org ). Unlike the other practice-and-drill tools above, this online tool 
did not offer self-paced routes of learning or any recorded assessment. Instead, learners 
watched animations and listened to dialogs centered on English composition topics, such 
as "Getting Ready to Write,” "Developing Your Ideas,” "Revising Your Work,” and "Editing 
and Polishing." Learners had no opportunities to speak or write language or to engage 
with or manipulate the videos aside from pausing or replaying. 

Assessments. A version of a placement exam used at the researcher’s U.S. university 
served as the pre- and posttest. The exam included, (a) two 30-minute reading sections 
with a total of 15 questions related to two 300-word passages; (b) a 45-minute writing 
section that asked participants to compose an argument in the form of an essay; (c) a 30 - 
minute listening section of 20 questions in response to a 3-minute audio dialog; (d) an 
approximately 5-to-7-minute speaking section of a one-on-one conversation with the 
researcher; and (e) a 30-minute grammar section of 50 discrete-item and integrative 
questions. 

To measure amount of time learners spent using language-learning software each week, 
the researcher made use of built-in assessments in RS and TMM, as well as the university’s 
learning management system (LMS), through which participants gained access to the 
software. 
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Procedures 


In the previous semester, the researcher piloted the quasi-experiment. At the start of the 
15-week semester that made up this study’s duration, participants met the researcher in 
a computer lab. Those who gave informed consent (N - 78) completed surveys as well as 
pretests that measured reading, writing, speaking, listening, and grammar. Throughout 
the semester, learners were asked to use the software at least three hours per week and 
to record progress in weekly, LMS-hosted electronic journals. At the end of the semester, 
participants retook surveys and completed the posttest. The participants were debriefed, 
thanked, and added to an email list that would let them know when the study might be 
either published or otherwise reported. 

Results 

To answer the research question, the researcher ran, (a) outlier tests, Shapiro -Wilk tests 
of normal distribution, and Levene’s test for homogeneity of variances; (b) Welch ANOVA 
and Games-Howell post hoc tests to check for significant variance between groups; (c) 
paired t-tests on pre- and posttests, followed up by nonparametric Wilcoxon’s tests; and 
finally, (d) Pearson product-moment correlation tests on the number of hours participants 
used language-learning software and posttest scores. The researcher used IBM SPSS 
Statistics version 21 to analyze data. 

Outliers, Normal Distribution, and Homogeneity of Variances 

Descriptive results showed outliers in the RS group. Cook’s distance regression analysis 
(£>) indicated that RS observation 1 held (D = 0.101) and RS observation 18 showed (D = 
0.042). The researcher did not eliminate outliers’ scores from the data set, opting instead 
for robust statistical analyses and noting this limitation common in classroom-based 
research and convenience sampling. Shapiro-Wilk’s test for normal distribution showed 
that, overall, groups’ scores demonstrated normal distribution at the p > 0.05 level: RS 
(p = 0.75), TMM (p = 0.782), MEM (p = 0.268), and WOW (p = 0.092); however, evidence 
of normal distribution appeared to be lacking in sections of the tests for some groups. 
Finally, Levene’s test for homogeneity of variances found a lack of homogeneity at the p > 
0.05 level between the variances in the population (p = 0.028), indicating that variability 
of scores did not remain equal across the four groups and that equal variances among 
groups’ pretest scores could not be assumed. 

In conclusion, (a) significant outliers appeared in the RS group’s pretest but remained in 
later analyses (accounted for with robust tests); (b) significant normal distribution 
appeared among groups on pretests though a lack of normal distribution appeared among 
sections of the test for some groups; and (c) variability of scores did not remain equal 
across the groups. In response to these initial findings on pretest scores, the researcher 
selected the Welch ANOVA instead ofone-wayANOVAtest, Games-Howell posthoc instead 
of Tukey’s posthoc, and followed up t-tests with Wilcoxon’s nonparametric test. 

Welch ANOVA and Games-Howell Test on Pretest Scores 

Regarding the extent groups’ performance on pretests resembled each other, Welch 
ANOVA tests showed that, although groups' pretest scores exhibited no significant 
difference among one another in reading, listening, and grammar, significant differences 
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appeared atthe p < 0.05 level in writing (p = 0.001), speaking (p < 0.001), and totals (p < 
0.001). Games-Howell post hoc tests illuminated which groups differed and to what 
degree. RS pre-writing (M = 54.21, SD = 11.04) proved significantly lower than MEM pre- 
writing (M = 64.86, SD = 6.89) with significance atthe p < 0.05 level of (p =0.006) and also 
proved significantly lower than WOW pre-writing (M = 69.16, SD = 8.80) (p < 0.001). RS 
pre-speaking (M = 50.00, SD = 12.25) proved significantly lower than TMM pre-speaking 
(M = 82.90, SD = 12.62) (p < 0.001), WOW pre-speaking (M = 73.16, SD = 13.66) (p < 
0.001), and MEM pre-speaking (M = 64.29, SD = 8.26) (p = 0.001). Atthe same time, MEM 
pre-speaking (M = 64.29, SD = 8.26) proved significantly lower than TMM pre-speaking 
(M = 82.90, SD = 12.62) (p < 0.001). Regarding total scores, RS pretest (M = 53.00, SD = 
8.56) proved significantly lower than both TMM pretest (M = 65.84, SD = 7.65) (p < 0.001) 
and WOW pretest (M = 62.47, SD = 10.47) (p = 0.022) but not significantly lower than 
MEM pretest (M = 57.71, SD = 5.41). In turn, MEM pretest (M = 57.71, SD = 5.41) proved 
significantly lower than TMM pretest (M = 65.84, SD = 7.65) (p = 0.003). 

In summary, the groups showed no significant variance regarding their reading, listening, 
and grammar pretest scores. Still, significant variance appeared elsewhere: (a) RS scored 
significantly lower than MEM and WOW on writing, significantly lower than all other 
groups on speaking, and significantly lower than TMM and WOW in total scores; (b) MEM, 
in turn, scored significantly lower than TMM on speaking and significantly lower than 
TMM on the pretest total. Pretest performance before treatment appeared in this order: 
TMM (65.8%), WOW (62.5%), MEM (57.7%), and RS (53%). 

Finally, Welch ANOVA tests indicated significant difference between pretest means of 
female participants (M = 62, SD = 8.23) and pretestmeans ofmale participants (M = 56.74, 
SD = 10), with significance atthe p < 0.05 level of (p = 0.015). 

Paired Samples Analyses 

To measure for difference between pre- and posttest means, the researcher ran analyses 
on paired samples. Table 1 below describes paired samples from pre - and posttests for all 
groups. 

Table 1. Groups’ Paired Samples Statistics 


Pairs 

Mean 

N 

SD 

Std. Error Mean 

RS pre-Reading 

45.895 

19 

16.248 

3.728 

RS post-Reading 

44.842 

19 

18.446 

4.232 

TMM pre-Reading 

49.105 

19 

13.212 

3.031 

TMM post-Reading 

55.158 

19 

12.562 

2.882 

MEM pre-Reading 

39.857 

21 

14.578 

3.181 

MEM post-Reading 

44.381 

21 

14.596 

3.185 

WOW pre-Reading 

48.316 

19 

19.096 

4.381 

WOW post-Reading 

47.421 

19 

21.122 

4.846 
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Pairs 

Mean 

N 

SD 

Std. Error Mean 

RS pre-Writing 

54.210* 

19 

11.038 

2.532 

RS post-Writing 

63.260* 

19 

11.416 

2.619 

TMM pre-Writing 

63.632* 

19 

10.935 

2.509 

TMM post-Writing 

68.211* 

19 

11.989 

2.750 

MEM pre-Writing 

64.857* 

21 

6.887 

1.503 

MEM post-Writing 

72.429* 

21 

6.454 

1.408 

WOW pre-Writing 

69.158* 

19 

8.802 

2.019 

WOW post-Writing 

72.053* 

19 

6.329 

1.452 

RS pre-Listening 

54.470 

19 

12.681 

2.909 

RS post-Listening 

49.740 

19 

17.910 

4.109 

TMM pre-Listening 

63.421 

19 

13.443 

3.084 

TMM post-Listening 

65.790 

19 

11.698 

2.684 

MEM pre-Listening 

53.619 

21 

11.205 

2.445 

MEM post-Listening 

54.048 

21 

10.796 

2.356 

WOW pre-Listening 

52.632- 

19 

17.270 

3.962 

WOW post-Listening 

59.474- 

19 

15.714 

3.605 

RS pre-Speaking 

50.000 

19 

12.247 

2.810 

RS post-Speaking 

51.580 

19 

12.478 

2.863 

TMM pre-Speaking 

82.895 

19 

12.618 

2.895 

TMM post-Speaking 

81.579 

19 

14.342 

3.290 

MEM pre-Speaking 

64.286* 

21 

8.259 

1.802 

MEM post-Speaking 

68.810* 

21 

9.605 

2.096 

WOW pre-Speaking 

73.158* 

19 

13.664 

3.135 

WOW post-Speaking 

77.368* 

19 

11.828 

2.714 

RS pre-Grammar 

60.790- 

19 

13.811 

3.168 

RS post-Grammar 

44.840- 

19 

18.446 

4.232 

TMM pre-Grammar 

70.000 

19 

12.065 

2.768 

TMM post-Grammar 

69.790 

19 

11.703 

2.685 

MEM pre-Grammar 

65.952 

21 

12.343 

2.694 

MEM post-Grammar 

68.381 

21 

10.509 

2.293 
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Pairs 

Mean 

N 

SD 

Std. Error Mean 

WOW pre-Grammar 

69.105 

19 

11.907 

2.732 

WOW post-Grammar 

71.368 

19 

11.908 

2.732 

RS pretest TOTAL 

53.000 

19 

8.564 

1.965 

RS posttest TOTAL 

54.620 

19 

11.400 

2.615 

TMM pretest TOTAL 

65.842 

19 

7.654 

1.756 

TMM posttest TOTAL 

68.000 

19 

8.901 

2.042 

MEM pretest TOTAL 

57.714* 

21 

5.414 

1.182 

MEM posttest TOTAL 

61.571* 

21 

5.250 

1.146 

WOW pretest TOTAL 

62.474* 

19 

10.474 

2.403 

WOW posttest TOTAL 

65.526* 

19 

9.571 

2.196 

*Significantly better score (paired t -testanc 

-Significant on paired t-test but nominally ii 

~Significantly worse score 

Wilcoxon’s test) 

isignificant on Wilcoxon’s test 


Table 1 shows the extent to which all groups improved on posttests. Closer analysis, 
though, determined whether changes between pre - and posttest scores proved significant 
(see Table 2). 

Table 2. Groups' Paired Samples t-tests 
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Paired Differences 



t 

df 

Sig. (2- 









tailed) 

(*p<0.05) 

Pairs 

Mean 

SD 

Std. 

95% Confidence 






Error 

Interval 

of the 






Mean 

Difference 








Lower 

Upper 




TMM 

-4.579* 

5.368 

1.231 

-7.166 

-1.992 

- 

18 

.002* 

Writing 






3.718 



MEM 

-7.571* 

7.763 

1.694 

-11.105 

-4.038 

- 

20 

.000* 

Writing 






4.470 



WOW 

-2.895* 

4.795 

1.100 

-5.206 

-.584 

- 

18 

.017* 

Writing 






2.632 



RS 

Listening 

4.737 

18.141 

4.162 

-4.007 

13.480 

1.138 

18 

.270 

TMM 

-2.368 

9.771 

2.242 

-7.078 

2.341 

- 

18 

.305 

Listening 






1.057 



MEM 

Listening 

-.429 

12.464 

2.720 

-6.102 

5.245 

-.158 

20 

.876 

WOW 

-6.842* 

14.163 

3.250 

-13.668 

-.016 

- 

18 

.050* 

Listening 






2.106 



RS 

-1.579 

6.248 

1.433 

-4.590 

1.432 

- 

18 

.285 

Speaking 






1.102 



TMM 

Speaking 

1.316 

9.696 

2.224 

-3.357 

5.989 

.592 

18 

.562 

MEM 

-4.524* 

5.456 

1.191 

-7.007 

-2.041 

- 

20 

.001* 

Speaking 






3.800 



WOW 

-4.211* 

7.502 

1.721 

-7.827 

-.595 

- 

18 

.025* 

Speaking 






2.446 



RS 

Grammar 

15.947* 

20.791 

4.770 

5.926 

25.968 

3.343 

18 

.004* 

TMM 

Grammar 

.211 

7.656 

1.757 

-3.480 

3.901 

.120 

18 

.906 

MEM 

-2.429 

9.796 

2.138 

-6.888 

2.030 

- 

20 

.269 

Grammar 






1.136 



WOW 

-2.263 

5.646 

1.295 

-4.984 

.458 

- 

18 

.098 

Grammar 






1.747 



RS TOTAL 

-1.620 

4.910 

1.126 

-3.986 

.746 

- 

18 

.168 







1.438 
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Summary of Significant Improvement 

Reading. Analysis of pre- and posttest means revealed that no group significantly 
improved its reading scores; however, statistically insignificant improvements appeared 
between all groups’ pre- and posttest mean scores. Wilcoxon’s signed rank test confirmed 
a lack of statistically significant improvement. 

Writing. Analysis of pre- and posttest means revealed that all groups significantly 
improved at the p < 0.05 level in writing. RS pre-writing (M = 54.21, SD = 11.04) differed 
significantly fromRS post-writing (M = 63.26, SD = 11.42); t(18)=-4.67, p < 0.001. MEM 
pre-writing (M = 64.86, SD = 6.89) differed significantly from MEM post-writing (M = 
72.43, SD = 6.45); t(20)=-4.47, p < 0.001. TMM pre-writing (M = 63.63, SD = 10.94) 
differed significantly from TMM post-writing (M = 68.21, SD = 11.99); t(18)=-3.72, p = 
0.002. Finally, WOW pre-writing (M = 69.16, SD = 8.80) differed significantly from WOW 
post-writing (M = 72.05, SD = 6.33); t(18)=-2.63, p = 0.017. Wilcoxon’s signed rank test 
confirmed significant improvements at the p < 0.05 level: RS (p = 0.001), MEM (p = 0.001), 
TMM (p = 0.006), and WOW (p = 0.022). 

Listening. Analysis of pre- and posttest means revealed that only the WOW group 
improved its listening score. WOW pre-listening (M = 52.63, SD=17.27) differed 
significantly from WOW post-listening (M = 59.47, SD = 15.71); t(18)=-2.11, p = 0.05; 
however, Wilcoxon’s signed rank test showed nominally insignificant change at the p < 
0.05 level (p = 0.051). TMM and MEM both made statistically insignificant improvements 
while RS performed worse on the listening posttest, though not significantly so. 
Wilcoxon’s signed rank test confirmed these findings. 

Speaking. Analysis of pre- and posttest means revealed that only MEM and WOW 
significantly improved on speaking. MEM pre-speaking (M = 64.29, SD = 8.26) differed 
significantly from MEM post-speaking (M = 68.81, SD = 9.61); t(20)=-3.80, p = 0.001. 
WOW pre-speaking (M = 73.16, SD = 13.66) differed significantly from WOW post¬ 
speaking (M = 77.37, SD = 11.83); t(18)=-2.45, p = 0.025. Statistically insignificant 
improvements appeared from RS and TMM. Wilcoxon’s signed rank test confirmed 
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significant improvements atthe p < 0.05 level, MEM (p = 0.002) and WOW (p = 0.035), as 
well as statistically insignificant improvements of RS and TMM. 

Grammar. Analysis of pre- and posttest means revealed that no group significantly 
improved its grammar scores; however, statistically insignificant improvements appeared 
from MEM, WOW, and TMM. RS scored significantly lower on grammar posttests, with RS 
pre-grammar (M = 60.79, SD = 13.81) differing significantly fromRS post-grammar (M = 
44.84, SD = 18.45); t(18)=3.34, p =0.004. Reasons for this probablyinclude, (a) RS group’s 
starting out at a lower point at the beginning of the semester; (b) RS group’s spending 
significantly less time using the technology compared with other groups (to be discussed 
in detail below); and (c) RS group’s higher percentage of male participants, who scored 
significantly lower than female participants in the study (to be discussed in detail below). 
Wilcoxon’s signed rank test confirmed the RS group’s significant change on grammar 
performance atthe p < 0.05 level (p = 0.005). 

Total scores. Analysis of pre- and posttest means revealed significant improvement from 
MEM and WOW. MEM pretest (M = 57.71, SD = 5.41) differed significantly from MEM 
posttest (M = 61.57, SD = 5.25); t(20)=-4.11, p = 0.001. WOW pretest (M = 62.47, SD = 
10.47) differed significantly from WOW posttest (M = 65.53, SD = 9.57); t(18)=-3.68, p = 
0.002. Insignificant improvements came from RS and TMM. Wilcoxon’s signed rank test 
confirmed significant changes atthe p < 0.05 level, MEM (p =0.001) and WOW (p = 0.003), 
as well as statistically insignificant improvements of RS and TMM. 

Welch ANOVA tests indicated significant difference between posttest means of female 
participants (M = 65.05, SD = 8.57) and posttest means of male participants (M = 58.99, 
SD = 11.08), with significance atthe p < 0.05 level of (p = 0.008). 

Posttest Scores and Time Spent Using Language-Learning Software 

To understand how time using language-learning software related to participants’ test 
scores, the researcher checked for normal distribution, checked for homogeneity of 
variances, ran one-way ANOVA tests, ran Tukey’s post hoc test, and finally ran Pearson 
product-moment correlation tests. 

Shapiro-Wilk tests showed groups’ means of hours using technology exhibited significant 
normal distribution at the p > 0.05 level: RS (p = 0.714), TMM (p = 0.331), MEM (p = 
0.703), WOW (p = 0.188). Next, Levene’s test of homogeneity of variances found 
significant homogeneity at the p > 0.05 level (p = 0.112). This measurement showed that 
variability of mean hours of using language-learning software remained equal across the 
four groups and that significantly equal variances could be assumed. Table 3 describes 
hours participants spent using language-learning software. 


Table 3. Hours Using Technology by Group 



N 

Mean 

SD 

Min. 

Max. 

RS 

19 

27:13 

16:31 

4:08 

62:32 

TMM 

19 

39:10 

17:00 

12:44 

65:26 
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N 

Mean 

SD 

Min. 

Max. 

MEM 

21 

36:56 

19:00 

12:13 

82:00 

WOW 

19 

20:13 

12:43 

4:00 

60:58 

Total 

78 

31:23 

17:35 

4:00 

82:00 


Still, one-way ANOVA tests on hours participants used software in the study evidenced 
significant differences among means at the p< 0.05 level [F(l, 76) = 11.43, p - 0.001)]. 
Looking more closely, Tukey post hoc tests showed where significant variation occurred: 
Both TMM (+19 hours, p - 0.002) and MEM (+16 hours, p = 0.008) used the technology 
significantly more than WOW. In spite of that, no group’s mean met the targeted goal of 
45 hours total, 3 hours per week for the 15 total weeks that made up the duration of the 
study. 

Finally, Welch ANOVA tests indicated significant difference between the means of the total 
number of hours that female participants used the technology (M = 36:30, SD = 19:36) 
and the means of the total number of hours male participants used the technology (M = 
24:14, SD = 12:30), with significance at the p < 0.05 level of (p = 0.001). 

Correlation Between Hours Using Technology and Posttest Scores 

Pearson product-moment correlation tests measured if and to what level of significance 
the amount of time learners spent using language-learning software related to posttest 
scores. Results showed no significant correlation overall between total hours using 
language-learning software and reading, writing, listening, speaking, grammar, or posttest 
totals, indicating that the increase in overall software use did not correlate with increased 
scores. 

In the RS, TMM, MEM, and WOW groups individually, Pearson product-moment 
correlation tests also showed that total hours using the software did not correlate with 
posttest scores in any section of the test or test totals. 

Summary of Findings 

The WOW group showed significant improvement in three of the five skills (i.e., writing, 
listening, and speaking) and significant improvement overall. The second most successful 
software appeared to be MEM, whose group showed statistically highly significant 
improvement in writing and significant improvement overall. Finally, the RS and the TMM 
groups showed statistically significant improvement in writing but not overall. 


Table 4. Summary of Findings 


Group 

M Hours 

Reading 

Writing 

Listening 

Speaking 

Grammar 

TOTAL 

RS 

27:13 


P 

<.001 
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Group 

M Hours 

Reading 

Writing 

Listening 

Speaking 

Grammar 

TOTAL 

TMM 

39:10 


P 

= .002 





MEM 

36:56 

— 

P 

— 

P 

— 

P 




<.001 


= .001 


= .001 

WOW 

20:13 

— 

p 

P 

P 

— 

P 




= .017 

= .05* 

= .025 


.002 


* = Wilcoxon’s test showed nominally insignificant improvement (p = 0.051). 


Discussion 

The researcher investigated the following research question: How do Rosetta Stone, Tell 
Me More, Memrise, and ESL WOW impact the English performance on a high-stakes 
standardized test meant to measure reading, writing, listening, speaking, and grammar? 

Although RS and TMM groups both improved only on writing, and although the RS group’s 
writing improvement proved statistically highly significant, the RS group's starting out at 
a lower level seems to have offered more room for fundamental improvement. Still, at least 
among this group of learners, RS failed to motivate participants to approach the target of 
3 hours per week, 45 hours total over the 15-week duration of the study. This finding 
seems in line with Nielson’s (2011) findings, which showed participants using RS much 
less than anticipated and receiving little overall benefit. These findings also seem 
consistent with previous criticisms of RS. For instance, Lopez-Lopez (2012) argued that 
RS lacked cognitive challenge, offered weak connections between vo cabulary, grammar, 
and the learners’ lives, did not allow for engaging collaboration, offered no real reading or 
writing practice across the curriculum, and ignored opportunities for learners to build 
reading or writing strategies. The RS group’s relatively infrequent use of the software 
might have resulted from a perception that the program lacked connections to learners’ 
lives; indeed, Zheng (2012) found that Chinese learners of English lost interest in areas of 
English learning not emphasized in class. The findings of this study also seem consistent 
with DeWaard’s (2013) evaluation that, in spite of its self-marketing as a one-stop solution 
for language learning, RS fails to represent "a viable option for language learning" (p. 61). 
In contrast, the amount of time TMM learners used the software seems inconsistent with 
Nielson’s (2011) findings, in which participants mostly failed to use TMM according to 
protocol. 

It seemed that TMM and MEM garnered the most attention from participants. Perhaps 
TMM’s optional focus on business or everyday English, as well as the level of online 
instruction matching participants’ scores on TMM placement exams upon initiation of the 
program, offered more interest to learners and offered more meaningful, comprehensible 
engagement. TMM’s having face validity to learners, and in that way earning more 
attention, would be consistent with Hashim and Yunus’s (2010) survey data, in which 
learners of English reported thinking TMM was easy to use, helpful for learning English, 
and generally suitable for language learning. In addition, TMM group's improving only on 
writing might reflect findings from Zhang and Guo (2012), who concluded that as Chinese 
learners’ proficiency increased, their motivation decreased, especially intrinsic 
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motivation. The TMM group began at the highest performance level on pretests, which 
might help explain an overall lack of significant improvement. 

The MEM group improved the most in productive skills (writing and speaking). Previous 
studies have pointed out Chinese learners of English at the university level improved 
vocabulary levels most markedly during their first and second years but then plateaued 
during their third (Cui & Wang, 2006; Tan, 2006). Others have accounted for this 
stabilization by indicating pedagogical shortcomings, such as few opportunities to use 
rich vocabulary in classrooms (Laufer, 1998). The findings here indicate that sustained 
vocabulary instruction using out-of-class activities with MEM may impact both writing 
and speaking test performance among Chinese learners of university level. Regarding the 
group’s relatively frequent use of the software, it may be that learners’ ability to view 
individual achievement in relation to peers and to other learners worldwide on 
leaderboards sparked interest through competition. 

As mentioned above, the WOW group showed significant improvement in writing, 
listening, and speaking, as well as overall. Although the WOW group made perhaps the 
most comprehensive improvement, learners in this group also used the software the least 
amount of time on average. This might reflect findings from Zhang and Guo (2012), which 
indicated that as Chinese learners’ proficiency of English increased, their motivation 
decreased, especially intrinsic motivation. The fewer amount o f hours using the software, 
then, might indicate lower motivation, which has correlated in the past with higher 
proficiency among similar learners. 

Interestingly, findings in this study did not seem to confirm those of Peng (2014), who 
examined Chinese students’ perceptions of which skills posed the greatest challenges and 
who reported that Chinese learners of English believed speaking was the hardest skill to 
learn, with listening seeming to be the easiest. In this study, RS and TMM (which offered 
the most speaking opportunities of all the software) did not seem to relate to improved 
speaking test performance. 

Finally, female participants showed significant difference from male participants in three 
main areas: pretest scores, posttest scores, and average number of hours using software 
out of class during the study. These findings support earlier research indicating more 
motivated female English language learners in China (Lamb, 2004; Liu, 2009,2012; Yang, 
Liu, & Wu, 2010). 

Limitations 

Gass and Mackey (2012) identified limitations inherent in (quasi-)experimental 
classroom-based research, naming intervening variables such as frequency and intensity 
of English engagement learners experienced outside of the study. The quasi-experimental 
design used in this study also opened itself up to sampling limitations. Because 
convenience sampling makes use of participants who are both available and willing to be 
studied, the groups may not represent the population (Creswell, 2012). Small samples 
sizes, as well as an RS group with a greater number of males to females, also limited how 
reliably conclusions can follow data analysis or comparisons be made between the groups. 

Another limitation concerned the infrastructure at the research site. Some learners 
reported computer crashes or slow Internet, especially when trying to use RS and TMM. 
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This situation is not new in China. Zou (2011) explored CALL use among Chinese learners 
and found that differences in computer-based facilities (with fixed, teacher-centered 
setups or movable seats) posed challenges to implementing CALL uniformly. 

Implications and Future Research 

With English test performance remaining important to Chinese ELLs both in China and 
elsewhere, teachers and researchers need to continue to explore motivating and engaging 
activities. Zheng (2012) warned that what a teacher emphasizes may shape a learner’s 
opinion of what aspects of English deserve attention. This study indicated that text- and 
vocabulary-centered MEM and WOW might impact writing and speaking test 
performance more than RS and TMM, which promote themselves as one -stop solutions to 
language learning. MEM and WOW remain free, open-source tools available to any learner 
in the world with an Internet connection and therefore might represent the most potent 
supplemental tools for settings with limited resources. Future research could build on the 
findings here by using mixed-methods designs to paint more all-inclusive pictures of the 
performance and perceptions of learners using this study’s software, such as explanatory- 
sequential or multiphase designs. 
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